A Maximum Entropy Approach to Identifying Sentence Boundaries

نویسندگان

  • Jeffrey C. Reynar
  • Adwait Ratnaparkhi
چکیده

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of . , ?, and / as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains. 1 I n t r o d u c t i o n The task of identifying sentence boundaries in text has not received as much attention as it deserves. Many freely available natural language processing tools require their input to be divided into sentences, but make no mention of how to accomplish this (e.g. (Brill, 1994; Collins, 1996)). Others perform the division implicitly without discussing performance (e.g. (Cutting et al., 1992)). On first glance, it may appear that using a short list of sentence-final punctuation marks, such a s . , ?, and /, is sufficient. However, these punctuation marks are not used exclusively to mark sentence breaks. For example, embedded quotations may contain any of the sentence-ending punctuation marks and . is used as a decimal point, in email addresses, to indicate ellipsis and in abbreviations. Both / and ? are somewhat less ambiguous * The authors would like to aclmowledge the support of ARPA grant N66001-94-C-6043, ARO grant DAAH0494-G-0426 and NSF grant SBR89-20230. but appear in proper names and may be used multiple times for emphasis to mark a single sentence boundary. Lexically-based rules could be written and exception lists used to disambiguate the difficult cases described above. However, the lists will never be exhaustive, and multiple rules may interact badly since punctuation marks exhibit absorption properties. Sites which logically should be marked with multiple punctuation marks will often only have one ((Nunberg, 1990) as summarized in (White, 1995)). For example, a sentence-ending abbreviation will most likely not be followed by an additional period if the abbreviation already contains one (e.g. note that D. C is followed by only a single . in The president lives in Washington, D.C.). As a result, we believe that manually writing rules is not a good approach. Instead, we present a solution based on a maximum entropy model which requires a few hints about what information to use and a corpus annotated with sentence boundaries. The model trains easily and performs comparably to systems that require vastly more information. Training on 39441 sentences takes 18 minutes on a Sun Ultra Sparc and disambiguating the boundaries in a single Wall Street Journal article requires only 1.4 seconds. 2 P r e v i o u s W o r k To our knowledge, there have been few papers about identifying sentence boundaries. The most recent work will be described in (Pa.lmer and Hearst, To appear). There is also a less detailed description of Pahner and Hearst's system, SATZ, in (Pahuer and Hearst, 1994). 1 The SATZ architecture uses either a decision tree or a neural network to disambiguate sentence boundaries. The neural network achieves 98.5% accuracy on a corpus of Wall Str'eet Journal t~Ve recommend these articles for a more comprehensive review of sentence-boundary identification work than we will be able to provide here.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Maximum Entropy Approach to Sentence Boundary Detection of Vietnamese Texts

We present for the first time a sentence boundary detection system for identifying sentence boundaries in Vietnamese texts. The system is based on a maximum entropy model. The training procedure requires no hand-crafted rules, lexicon, or domain-specific information. Given a corpus annotated with sentence boundaries, the model learns to classify each occurrence of potential end-of-sentence punc...

متن کامل

A Maximum Entropy Approach to Identifying

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries , our model learns to classify each occurrence of ., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lex-ica, part-of-speech tags, or domain-speciic information. The model can therefore be trained easily o...

متن کامل

Utterance Segmentation Using Combined Approach Based on Bi-directional N-gram and Maximum Entropy

This paper proposes a new approach to segmentation of utterances into sentences using a new linguistic model based upon Maximum-entropy-weighted Bidirectional N-grams. The usual N-gram algorithm searches for sentence boundaries in a text from left to right only. Thus a candidate sentence boundary in the text is evaluated mainly with respect to its left context, without fully considering its rig...

متن کامل

Senseval automatic labeling of semantic roles using Maximum Entropy models

As a task in SensEval-3, Automatic Labeling of Semantic Roles is to identify frame elements within a sentence and tag them with appropriate semantic roles given a sentence, a target word and its frame. We apply Maximum Entropy classification with feature sets of syntactic patterns from parse trees and officially attain 80.2% precision and 65.4% recall. When the frame element boundaries are give...

متن کامل

Discriminative Training and Maximum Entropy Models for Statistical Machine Translation

We present a framework for statistical machine translation of natural languages based on direct maximum entropy models, which contains the widely used source-channel approach as a special case. All knowledge sources are treated as feature functions, which depend on the source language sentence, the target language sentence and possible hidden variables. This approach allows a baseline machine t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997